Abuzar Akhtar
B.tech CSE

applied-statistics-def4b137-48c5-4915-aaa9-b3517c021fc1

A p p l i e d S t a t i s t i c s A p p l i e d S t a t i s t i c s AppliedStatisticsApplied\,StatisticsAppliedStatistics


Curve Fitting by the Method of Least Squares

Curve fitting is a process of finding a mathematical model that best approximates a set of data points. The method of least squares is a widely used technique for curve fitting, aiming to minimize the sum of the squared differences between the observed data points and the values predicted by the model.

Linear Curve Fitting

In linear curve fitting, we seek to find the best-fitting straight line that describes the relationship between the dependent variable y y yyy and the independent variable x x xxx. The linear equation has the form:
y = m x + b y = m x + b y=mx+by = mx + by=mx+b
where:
  • m m mmm is the slope of the line, representing the rate of change of y y yyy with respect to x x xxx.
  • b b bbb is the y-intercept, the value of y y yyy when x x xxx is 0.
Adjust the slider for the best-fitting straight line
Given a set of n n nnn data points ( x i , y i ) ( x i , y i ) (x_(i),y_(i))(x_i, y_i)(xi,yi), the objective of linear curve fitting is to find the values of m m mmm and b b bbb that minimize the sum of the squared residuals (differences between the actual y i y i y_(i)y_iyi and the predicted values m x i + b m x i + b mx_(i)+bmx_i + bmxi+b):
minimize i = 1 n ( y i ( m x i + b ) ) 2 minimize i = 1 n y i ( m x i + b ) 2 "minimize"sum_(i=1)^(n)(y_(i)-(mx_(i)+b))^(2)\begin{equation*} \text{minimize} \sum_{i=1}^{n} \left(y_i - (mx_i + b)\right)^2 \end{equation*}minimizei=1n(yi(mxi+b))2
 Example:-
Suppose we have the following set of data points representing the relationship between x x xxx and y y yyy:
x x xxx y y yyy
1 3
2 5
3 7
4 10
5 12
x y 1 3 2 5 3 7 4 10 5 12| $x$ | $y$ | | :---: | :---: | | 1 | 3 | | 2 | 5 | | 3 | 7 | | 4 | 10 | | 5 | 12 |
We want to find a linear equation of the form y = m x + b y = m x + b y=mx+by = mx + by=mx+b that best fits this data.
Solution:

Step 1: Set up the equations

The linear equation to fit the data is: y = m x + b y = m x + b y=mx+by = mx + by=mx+b. We want to find the values of m m mmm and b b bbb that minimize the sum of squared differences between the observed y y yyy-values and the predicted y y yyy-values from the equation.
For each data point ( x i , y i ) ( x i , y i ) (x_(i),y_(i))(x_i, y_i)(xi,yi), we have the following equation:
y i = m x i + b y i = m x i + b y_(i)=mx_(i)+by_i = mx_i + byi=mxi+b

Step 2: Formulate the system of equations

For our given data points, we have five equations:
1. 3 = m × 1 + b 2. 5 = m × 2 + b 3. 7 = m × 3 + b 4. 10 = m × 4 + b 5. 12 = m × 5 + b 1. 3 = m × 1 + b 2. 5 = m × 2 + b 3. 7 = m × 3 + b 4. 10 = m × 4 + b 5. 12 = m × 5 + b {:[1.quad3=m xx1+b],[2.quad5=m xx2+b],[3.quad7=m xx3+b],[4.quad10=m xx4+b],[5.quad12=m xx5+b]:}\begin{align*} 1. &\quad 3 = m \times 1 + b \\ 2. &\quad 5 = m \times 2 + b \\ 3. &\quad 7 = m \times 3 + b \\ 4. &\quad 10 = m \times 4 + b \\ 5. &\quad 12 = m \times 5 + b \\ \end{align*}1.3=m×1+b2.5=m×2+b3.7=m×3+b4.10=m×4+b5.12=m×5+b

Step 3: Use the Method of Least Squares

The method of least squares involves finding the values of m m mmm and b b bbb that minimize the sum of the squares of the vertical distances (residuals) between the observed data points and the corresponding points on the fitted line.
To minimize the sum of squared residuals, we take the partial derivatives of the sum of squared residuals with respect to m m mmm and b b bbb and set them equal to zero. Solving these equations will give us the values of m m mmm and b b bbb that minimize the sum of squared residuals.
Let's define the sum of squared residuals (SSR) as:
S S R = i = 1 5 ( y i ( m x i + b ) ) 2 S S R = i = 1 5 ( y i ( m x i + b ) ) 2 SSR=sum_(i=1)^(5)(y_(i)-(mx_(i)+b))^(2)SSR = \sum_{i=1}^{5} (y_i - (mx_i + b))^2SSR=i=15(yi(mxi+b))2
Taking partial derivatives and setting them equal to zero:
S S R m = 0 S S R m = 0 (del SSR)/(del m)=0\frac{\partial SSR}{\partial m} = 0SSRm=0
S S R b = 0 S S R b = 0 (del SSR)/(del b)=0\frac{\partial SSR}{\partial b} = 0SSRb=0

Step 4: Solve for m m mmm and b b bbb

By solving the above equations, we can obtain the values of m m mmm and b b bbb that minimize the sum of squared residuals.
After solving the system of equations, we get:
m 2.3 m 2.3 m~~2.3m \approx 2.3m2.3
b 0.6 b 0.6 b~~0.6b \approx 0.6b0.6
So, the best-fitting linear equation is: y 2.3 x + 0.6 y 2.3 x + 0.6 y~~2.3 x+0.6y \approx 2.3x + 0.6y2.3x+0.6.

Step 5: Plot the linear curve on a graph

To visualize the linear curve fit, we can plot it on a graph along with the original data points.
In conclusion, by using the method of least squares, we found the equation y 2.3 x + 0.6 y 2.3 x + 0.6 y~~2.3 x+0.6y \approx 2.3x + 0.6y2.3x+0.6 that best fits the given data points.
\,
\,
\,

Solving Linear Curve Fitting by Simplified Approach

Let's consider a set of data points representing the relationship between an independent variable x x xxx and a dependent variable y y yyy.
The general equation for a linear function is:
y = m x + b y = m x + b y=mx+by = mx + by=mx+b
where m m mmm is the slope and b b bbb is the y-intercept of the line.
Step 1: Calculate the necessary sums
Calculate the following sums from the given data:
  • x x sum x\sum xx: The sum of all x x xxx-values in the dataset.
  • y y sum y\sum yy: The sum of all y y yyy-values in the dataset.
  • x y x y sum xy\sum xyxy: The sum of the products of each x x xxx-value and its corresponding y y yyy-value.
  • x 2 x 2 sumx^(2)\sum x^2x2: The sum of the squares of all x x xxx-values in the dataset.
Step 2: Calculate the slope ( m m mmm) and y-intercept ( b b bbb)
Using the following formulas, we can calculate the slope ( m m mmm) and y-intercept ( b b bbb) for the best-fitting line:
m = n x y x y n x 2 ( x ) 2 m = n x y x y n x 2 ( x ) 2 m=(n sum xy-sum x sum y)/(n sumx^(2)-(sum x)^(2))m = \frac{n \sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2}m=nxyxynx2(x)2
b = y m x n b = y m x n b=(sum y-m sum x)/(n)b = \frac{\sum y - m \sum x}{n}b=ymxn
where n n nnn is the number of data points.
Step 3: Interpretation and plotting
The calculated values of m m mmm and b b bbb represent the slope and y-intercept of the best-fitting line that approximates the data. The best-fitting linear equation is then y = m x + b y = m x + b y=mx+by = mx + by=mx+b.
To visualize the linear curve fit, we can plot the line y = m x + b y = m x + b y=mx+by = mx + by=mx+b on a graph along with the original data points.
 EXAMPLE:-
Suppose we have the following set of data points representing the relationship between x x xxx and y y yyy:
x x xxx y y yyy
1 3
2 5
3 7
4 10
5 12
x y 1 3 2 5 3 7 4 10 5 12| $x$ | $y$ | | :---: | :---: | | 1 | 3 | | 2 | 5 | | 3 | 7 | | 4 | 10 | | 5 | 12 |
We want to find a linear equation of the form y = m x + b y = m x + b y=mx+by = mx + by=mx+b that best fits this data.
Solution:-
Step 1: Calculate the necessary sums
Calculate the following sums from the given data:
x , y , x y , x 2 , and y 2 x , y , x y , x 2 , and y 2 sum x,quad sum y,quad sum xy,quad sumx^(2),quad"and"quad sumy^(2)\sum x, \quad \sum y, \quad \sum xy, \quad \sum x^2, \quad \text{and} \quad \sum y^2x,y,xy,x2,andy2
For the given data:
x = 1 + 2 + 3 + 4 + 5 = 15 x = 1 + 2 + 3 + 4 + 5 = 15 sum x=1+2+3+4+5=15\sum x = 1 + 2 + 3 + 4 + 5 = 15x=1+2+3+4+5=15
y = 3 + 5 + 7 + 10 + 12 = 37 y = 3 + 5 + 7 + 10 + 12 = 37 sum y=3+5+7+10+12=37\sum y = 3 + 5 + 7 + 10 + 12 = 37y=3+5+7+10+12=37
x y = ( 1 × 3 ) + ( 2 × 5 ) + ( 3 × 7 ) + ( 4 × 10 ) + ( 5 × 12 ) = 134 x y = ( 1 × 3 ) + ( 2 × 5 ) + ( 3 × 7 ) + ( 4 × 10 ) + ( 5 × 12 ) = 134 sum xy=(1xx3)+(2xx5)+(3xx7)+(4xx10)+(5xx12)=134\sum xy = (1 \times 3) + (2 \times 5) + (3 \times 7) + (4 \times 10) + (5 \times 12) = 134xy=(1×3)+(2×5)+(3×7)+(4×10)+(5×12)=134
x 2 = 1 + 4 + 9 + 16 + 25 = 55 x 2 = 1 + 4 + 9 + 16 + 25 = 55 sumx^(2)=1+4+9+16+25=55\sum x^2 = 1 + 4 + 9 + 16 + 25 = 55x2=1+4+9+16+25=55
y 2 = 9 + 25 + 49 + 100 + 144 = 327 y 2 = 9 + 25 + 49 + 100 + 144 = 327 sumy^(2)=9+25+49+100+144=327\sum y^2 = 9 + 25 + 49 + 100 + 144 = 327y2=9+25+49+100+144=327
Step 2: Calculate the slope (m) and y-intercept (b)
The formula for the slope (m) is:
m = n x y x y n x 2 ( x ) 2 m = n x y x y n x 2 ( x ) 2 m=(n sum xy-sum x sum y)/(n sumx^(2)-(sum x)^(2))m = \frac{n \sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2}m=nxyxynx2(x)2
The formula for the y-intercept (b) is:
b = y m x n b = y m x n b=(sum y-m sum x)/(n)b = \frac{\sum y - m \sum x}{n}b=ymxn
where n n nnn is the number of data points, which in this case is 5.
Using the values from Step 1, we get:
m = 5 × 134 15 × 37 5 × 55 15 2 2.3 m = 5 × 134 15 × 37 5 × 55 15 2 2.3 m=(5xx134-15 xx37)/(5xx55-15^(2))~~2.3m = \frac{5 \times 134 - 15 \times 37}{5 \times 55 - 15^2} \approx 2.3m=5×13415×375×551522.3
b = 37 2.3 × 15 5 0.6 b = 37 2.3 × 15 5 0.6 b=(37-2.3 xx15)/(5)~~0.6b = \frac{37 - 2.3 \times 15}{5} \approx 0.6b=372.3×1550.6
So, the best-fitting linear equation is: y 2.3 x + 0.6 y 2.3 x + 0.6 y~~2.3 x+0.6y \approx 2.3x + 0.6y2.3x+0.6.
Step 3: Plot the linear curve on a graph
To visualize the linear curve fit, we can plot it on a graph along with the original data points.
In conclusion, by using the method of least squares, we found the equation y 2.3 x + 0.6 y 2.3 x + 0.6 y~~2.3 x+0.6y \approx 2.3x + 0.6y2.3x+0.6 that best fits the given data points.
\,
\,
\,
\,

Polynomial Curve Fitting

Polynomial curve fitting involves fitting a polynomial equation of degree n n nnn to the data points. The general form of a polynomial of degree n n nnn is:
y = a n x n + a n 1 x n 1 + + a 1 x + a 0 y = a n x n + a n 1 x n 1 + + a 1 x + a 0 y=a_(n)x^(n)+a_(n-1)x^(n-1)+dots+a_(1)x+a_(0)y = a_nx^n + a_{n-1}x^{n-1} + \ldots + a_1x + a_0y=anxn+an1xn1++a1x+a0
where a n , a n 1 , , a 0 a n , a n 1 , , a 0 a_(n),a_(n-1),dots,a_(0)a_n, a_{n-1}, \ldots, a_0an,an1,,a0 are the coefficients to be determined.
For polynomial curve fitting, the goal is to find the values of the coefficients a n , a n 1 , , a 0 a n , a n 1 , , a 0 a_(n),a_(n-1),dots,a_(0)a_n, a_{n-1}, \ldots, a_0an,an1,,a0 that minimize the sum of the squared residuals:
minimize i = 1 n ( y i ( a n x i n + a n 1 x i n 1 + + a 1 x i + a 0 ) ) 2 minimize i = 1 n y i a n x i n + a n 1 x i n 1 + + a 1 x i + a 0 2 "minimize"sum_(i=1)^(n)(y_(i)-(a_(n)x_(i)^(n)+a_(n-1)x_(i)^(n-1)+dots+a_(1)x_(i)+a_(0)))^(2)\begin{equation*} \text{minimize} \sum_{i=1}^{n} \left(y_i - \left(a_nx_i^n + a_{n-1}x_i^{n-1} + \ldots + a_1x_i + a_0\right)\right)^2 \end{equation*}minimizei=1n(yi(anxin+an1xin1++a1xi+a0))2
This applet can be used to enter data, see the scatter plot and view two polynomial fittings in the data (for comparison), If only one fit is desired enter 0 for Degree of Fit2 (or Fit1). If a data value is wrongly entered, select the correct check box and use use appropriate correct button to delete the (last) wrongly entered value. Selecting show fits check box displays equations as well as graphs. Selecting Instructions check box displays instructions.
\,
\,
\,
\,

Nonlinear Curve Fitting

In cases where the relationship between the variables is nonlinear, a general nonlinear equation is used for curve fitting. The nonlinear equation may not have a simple analytical form, and its parameters need to be estimated from the data.
For nonlinear curve fitting, we have a model with parameters θ θ theta\thetaθ (e.g., θ 1 , θ 2 , θ 1 , θ 2 , theta_(1),theta_(2),dots\theta_1, \theta_2, \ldotsθ1,θ2,). The goal is to find the values of θ θ theta\thetaθ that minimize the sum of the squared residuals:
minimize i = 1 n ( y i f ( x i , θ ) ) 2 minimize i = 1 n y i f ( x i , θ ) 2 "minimize"sum_(i=1)^(n)(y_(i)-f(x_(i),theta))^(2)\begin{equation*} \text{minimize} \sum_{i=1}^{n} \left(y_i - f(x_i, \theta)\right)^2 \end{equation*}minimizei=1n(yif(xi,θ))2
where f ( x i , θ ) f ( x i , θ ) f(x_(i),theta)f(x_i, \theta)f(xi,θ) is the function that relates x i x i x_(i)x_ixi and y i y i y_(i)y_iyi with the parameters θ θ theta\thetaθ.
\,
\,
\,
\,

Introduction to Hypothesis Testing

Hypothesis testing is a fundamental concept in statistics that allows us to make decisions about a population based on sample data. It helps us determine if there is enough evidence to support or reject a claim (hypothesis) about a population parameter.

Key Components

  1. Null Hypothesis ( H 0 ): Null Hypothesis ( H 0 ): "Null Hypothesis ("(H_(0))"):"\textbf{Null Hypothesis (\(H_0\)):}Null Hypothesis (H0): This is the initial assumption that there is no significant difference or effect. It represents the status quo and is usually denoted by H 0 H 0 H_(0)H_0H0.
  2. Alternative Hypothesis ( H a ): Alternative Hypothesis ( H a ): "Alternative Hypothesis ("(H_(a))"):"\textbf{Alternative Hypothesis (\(H_a\)):}Alternative Hypothesis (Ha): This is the claim or statement we want to investigate and determine if there is enough evidence to support. It opposes the null hypothesis and is denoted by H a H a H_(a)H_aHa.

Types of Hypotheses

  • One-Tailed (or One-Sided) Hypothesis: One-Tailed (or One-Sided) Hypothesis: "One-Tailed (or One-Sided) Hypothesis:"\textbf{One-Tailed (or One-Sided) Hypothesis:}One-Tailed (or One-Sided) Hypothesis: This type of hypothesis test focuses on either a positive or negative effect. It is denoted as H a : μ > μ 0 H a : μ > μ 0 H_(a):mu > mu_(0)H_a: \mu > \mu_0Ha:μ>μ0 (right-tailed) or H a : μ < μ 0 H a : μ < μ 0 H_(a):mu < mu_(0)H_a: \mu < \mu_0Ha:μ<μ0 (left-tailed), where μ μ mu\muμ is the population mean and μ 0 μ 0 mu_(0)\mu_0μ0 is the hypothesized value.
  • Two-Tailed Hypothesis: Two-Tailed Hypothesis: "Two-Tailed Hypothesis:"\textbf{Two-Tailed Hypothesis:}Two-Tailed Hypothesis: This type of hypothesis test is more general and looks for any significant difference, regardless of the direction. It is denoted as H a : μ μ 0 H a : μ μ 0 H_(a):mu!=mu_(0)H_a: \mu \neq \mu_0Ha:μμ0.

Steps for Hypothesis Testing

  1. Formulate Hypotheses: Formulate Hypotheses: "Formulate Hypotheses:"\textbf{Formulate Hypotheses:}Formulate Hypotheses: Define the null and alternative hypotheses based on the research question.
  2. Select Significance Level ( α ): Select Significance Level ( α ): "Select Significance Level ("alpha"):"\textbf{Select Significance Level (\(\alpha\)):}Select Significance Level (α): The significance level is the probability of rejecting the null hypothesis when it is actually true. Common choices include 0.05 (5%) or 0.01 (1%).
  3. Choose a Test Statistic: Choose a Test Statistic: "Choose a Test Statistic:"\textbf{Choose a Test Statistic:}Choose a Test Statistic: The choice of test statistic depends on the type of data and the specific hypothesis test.
  4. Calculate the Test Statistic: Calculate the Test Statistic: "Calculate the Test Statistic:"\textbf{Calculate the Test Statistic:}Calculate the Test Statistic: Using the sample data, compute the value of the test statistic.
  5. Determine the Critical Region: Determine the Critical Region: "Determine the Critical Region:"\textbf{Determine the Critical Region:}Determine the Critical Region: This is the region of extreme values of the test statistic that leads to the rejection of the null hypothesis.
  6. Compare Test Statistic with Critical Value: Compare Test Statistic with Critical Value: "Compare Test Statistic with Critical Value:"\textbf{Compare Test Statistic with Critical Value:}Compare Test Statistic with Critical Value: If the test statistic falls within the critical region, reject the null hypothesis. Otherwise, fail to reject the null hypothesis.

Common Test Statistics

  • Z-Test: Z-Test: "Z-Test:"\textbf{Z-Test:}Z-Test: Used when the population standard deviation ( σ σ sigma\sigmaσ) is known, and the sample size is sufficiently large.

    Formula: Z = X ¯ μ σ / n Z = X ¯ μ σ / n Z=(( bar(X))-mu)/(sigma//sqrtn)Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}Z=X¯μσ/n
  • T-Test: T-Test: "T-Test:"\textbf{T-Test:}T-Test: Used when the population standard deviation ( σ σ sigma\sigmaσ) is unknown or the sample size is small ( n < 30 n < 30 n < 30n < 30n<30).

    Formula (for one-sample t-test): t = X ¯ μ s / n t = X ¯ μ s / n t=(( bar(X))-mu)/(s//sqrtn)t = \frac{\bar{X} - \mu}{s/\sqrt{n}}t=X¯μs/n, where s s sss is the sample standard deviation.
  • Chi-Square Test ( χ 2 ): Chi-Square Test ( χ 2 ): "Chi-Square Test ("(chi^(2))"):"\textbf{Chi-Square Test (\(\chi^2\)):}Chi-Square Test (χ2): Used for categorical data to test independence or goodness of fit.

    Formula: χ 2 = ( O E ) 2 E χ 2 = ( O E ) 2 E chi^(2)=sum((O-E)^(2))/(E)\chi^2 = \sum \frac{(O - E)^2}{E}χ2=(OE)2E, where O O OOO is the observed frequency and E E EEE is the expected frequency.
  • F-Test: F-Test: "F-Test:"\textbf{F-Test:}F-Test: Used to compare the variance of two or more samples.

    Formula (for two-sample F-test): F = s 1 2 s 2 2 F = s 1 2 s 2 2 F=(s_(1)^(2))/(s_(2)^(2))F = \frac{s_1^2}{s_2^2}F=s12s22, where s 1 2 s 1 2 s_(1)^(2)s_1^2s12 and s 2 2 s 2 2 s_(2)^(2)s_2^2s22 are the sample variances.
Remember, hypothesis testing helps us draw conclusions based on evidence from sample data. Always interpret the results in the context of the problem and use statistical significance as a guide for decision-making.
Question: A company claims that the average response time of their customer support team is 10 minutes. To test this claim, a random sample of 36 customer support interactions was taken, and the average response time was found to be 12 minutes, with a sample standard deviation of 3 minutes. Conduct a hypothesis test using a two-tailed Z-test at a 5% significance level to determine if there is enough evidence to reject the company's claim.
 Solution:-
Step 1: Formulate Hypotheses
Null Hypothesis ( H 0 H 0 H_(0)H_0H0): The average response time of the customer support team is equal to 10 minutes.
H 0 : μ = 10 H 0 : μ = 10 H_(0):mu=10H_0: \mu = 10H0:μ=10
Alternative Hypothesis ( H a H a H_(a)H_aHa): The average response time of the customer support team is not equal to 10 minutes.
H a : μ 10 H a : μ 10 H_(a):mu!=10H_a: \mu \neq 10Ha:μ10
Step 2: Select Significance Level
In this case, the significance level ( α α alpha\alphaα) is given as 0.05 (5%).
Step 3: Choose a Test Statistic
Since the population standard deviation is unknown, we will use the Z-test for the mean.
Step 4: Calculate the Test Statistic
The formula for the Z-test is:
Z = X ¯ μ σ n Z = X ¯ μ σ n Z=(( bar(X))-mu)/((sigma)/(sqrtn))Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}Z=X¯μσn
where:
X ¯ X ¯ bar(X)\bar{X}X¯ = Sample mean (12 minutes)

μ μ mu\muμ = Population mean under the null hypothesis (10 minutes)

σ σ sigma\sigmaσ = Population standard deviation (unknown)

n n nnn = Sample size (36)
Step 5: Determine the Critical Region
Since this is a two-tailed test, we need to find the critical Z-values for a 5% significance level.
For a two-tailed test at 5% significance level, we divide the significance level by 2 to get 2.5% on each tail. Using a Z-table or calculator, we find the critical Z-values to be approximately -1.96 and +1.96.
Step 6: Compare Test Statistic with Critical Value
Calculate the test statistic:
Z = 12 10 3 36 = 2 3 6 = 4 Z = 12 10 3 36 = 2 3 6 = 4 Z=(12-10)/((3)/(sqrt36))=(2)/((3)/(6))=4Z = \frac{12 - 10}{\frac{3}{\sqrt{36}}} = \frac{2}{\frac{3}{6}} = 4Z=1210336=236=4
Since the calculated Z-value (4) is greater than the critical Z-values (-1.96 and 1.96), we reject the null hypothesis.
Step 7: Conclusion
There is enough evidence to reject the company's claim that the average response time of their customer support team is 10 minutes. The data suggests that the average response time is significantly different from 10 minutes.